Introduction

The goal of this project is to know whether a savings customer will take a credit or not. We have different sources of data, including savings account transactions, ZIP code, ATM geographical and transactional information and open data regarding crime and sociodemographic areas in Mexico.

In this document we present an exploratory data analysis on the available information. There are around 12 million savings customers and 800 thousand credit AND savings customers in Banco Azteca (BAZ), from which we have a sample of 1 million people for savings. The analysis in this document is based on the information of this sample and the whole population of credit customers.

Variable Analysis

Transactions and Amount

abonos abonos_monto retiros retiros_monto num_meses tiempo_meses freq
0% 0 0.000 0 -39804335.20 1 1 0.0312500
5% 1 1.000 0 -135549.10 1 6 0.0526316
10% 1 50.000 0 -80000.00 1 8 0.0714286
15% 1 150.000 1 -55267.97 2 11 0.0967742
20% 1 554.000 1 -40600.00 2 13 0.1250000
25% 2 1212.975 2 -30724.68 3 14 0.1428571
30% 2 2075.000 2 -23850.00 3 16 0.1666667
35% 3 3200.000 3 -18616.00 4 18 0.2000000
40% 3 4636.000 4 -14750.00 4 19 0.2222222
45% 4 6200.000 5 -11500.00 5 21 0.2500000
50% 5 8350.000 7 -9006.00 5 23 0.2812500
55% 6 10800.000 8 -6940.00 6 25 0.3125000
60% 8 14000.000 10 -5170.00 7 27 0.3333333
65% 9 17850.000 12 -3897.79 8 29 0.3600000
70% 11 22917.640 15 -2650.00 9 30 0.3750000
75% 14 29950.000 19 -1650.00 10 31 0.3846154
80% 18 39550.000 25 -900.00 10 32 0.4210526
85% 24 54000.000 32 -300.00 11 32 0.5000000
90% 33 78500.000 45 0.00 12 32 0.5882353
95% 53 132810.063 72 0.00 12 32 0.7500000
100% 60199 42131739.960 22009 0.00 12 32 1.0000000

Gender

Tenure

Salary based on transactions

Number of months / Total months: Value between 0 and 1. If it’s 1 it means that the customer made an activity in all of the months that are available in the data; if it’s 0 it means that no activity took place. The value of 0 is not possible in this database because these are customers with at least one transaction.

credit_or_savings electronic_banking proportion
credit 0 0.9509844
credit 1 0.0490156
savings 0 0.9553800
savings 1 0.0446190
credit_or_savings active_electronic_banking proportion
credit 0 0.9686378
credit 1 0.0313622
savings 0 0.9772680
savings 1 0.0227310

Months with more activity

Histogram of date of first usage of credit

It can be seen that the number of people taking credits has been decreasing.

Geography

We have information about the customers’ ZIP code. This information could be used, with public available information from sources like INEGI, to know the socioeconomic level of each savings customer.

Available sources:

AGEB stands for Área GeoEstadística Básica (Basic Geostatistical Area), and a locality is a general term used by CONAPO to define several AGEBs.

This document uses information from the socioeconomic regions defined by INEGI and margination index by locality defined by CONAPO.

ZIP code geographical information is available. According to the official postal code webpage, there are 32,448 different ZIP codes in Mexico, from which around 25,000 are available as shape files. The official ZIP code shapefiles are available in the open data government webpage, but not all them are available yet, the mexican postal service is still working in finding the delimiters of each code. Other resources are available, for example, a non-official collection of shapefiles of neighborhoods and ZIP codes. In addition, Google’s API for geocoding is a useful tool which is used as a last resort to find information about some ZIP codes.

Even with all this available information, there’s still a problem, which is that there are a bunch of ZIP codes which aren’t officially assigned to any human settlement but that are being used by people due to tradition or misinformation. So, geographic information may not be available for all customers, but it will be for most of them.

Problem:

The polygons defining the ZIP codes aren’t equivalent to the polygons defining the AGEBs and localities, so a mapping between them is needed to be able to use the public available information. Perhaps the simplest solution is to find the centroid of each ZIP code and AGEB or locality, and then just map a given ZIP code to the closest AGEB or locality centroid.

AGEB classification:

We have a classification for each AGEB that pretends to show the differences among AGEBs based on indicators related with housing, education, health and employment, built from the last population census. Each AGEB can be classified in 7 strata such that stratum 7 contains AGEBs with the most favorable average conditions, and in stratum 1 are the AGEBs with the least favorable average conditions.

In the next images, maps of Mexico City and surroundings, Monterrey and Guadalajara are shown.

Map of Mexico City with centroids of each polygon:

Now, same map for Guadalajara, Jalisco:

And finally, for Monterrey, Nuevo León:

ZIP code information with their centroids can be seen in the next map of Mexico City:

ZIP code information with their centroids can be seen in the next map of Guadalajara. Some of the centroids may not match perfectly the polygon plotted because the database considers a the ZIP code and the identifier as a different group.

ZIP code information with their centroids can be seen in the next map of Monterrey:

Finally, plotting the centroids of AGEBs and ZIP codes in Mexico City altogether we get:

Guadalajara:

Monterrey:

So, for each available ZIP code, the closest AGEB centroid is found and a mapping is made to assign an AGEB to each ZIP code, such that we get a table in the following format:

ZIP ZIP long ZIP lat Nearest AGEB AGEB long AGEB lat Distance in Km Classification
56364 -98.93143 19.44496 1.503100e+12 -98.93469 19.44372 0.3680725 3
56367 -98.95076 19.44106 1.503100e+12 -98.94869 19.43894 0.3201608 4
56365 -98.94247 19.43852 1.503100e+12 -98.94134 19.43799 0.1325068 4
96340 -94.60759 18.00084 3.004801e+12 -94.60721 18.00117 0.0547658 6
42850 -99.33818 19.92243 1.306300e+12 -99.33511 19.91824 0.5655460 6
57850 -98.97560 19.38088 1.505800e+12 -98.97690 19.38002 0.1661747 6
97300 -89.70512 21.01598 3.110000e+12 -89.74094 21.02427 3.8302693 2
61531 -100.37365 19.42391 1.611200e+12 -100.37496 19.42216 0.2384809 4
41706 -98.41225 16.69447 1.204600e+12 -98.40838 16.69271 0.4568835 4
53750 -99.24115 19.45593 1.505700e+12 -99.24014 19.45617 0.1088229 6

In the following graph, a histogram is plotted showing the distribution of the distance between the centroid of the ZIP code and the centroid of the AGEB. The red lines represent quantiles 0.5, 0.75, 0.9 and 0.95. As can be seen, most of the mass is concentrated in distances shorter than 10 Km. This may seem like little, but in the case of a city, the landscape can change dramatically in 10 Km.

In the following graph, the distance histogram is plotted once more, but with with a different graph depending on whether the ZIP code is in a rural, urban, semiurban or unknown type of area. In the urban and semiurban areas, more than 95% of ZIP codes are within a 2.5 Km distance from the closest centroid. The rural areas are the ones that have a shorter tail, which seems reasonable because rural areas are usually larger and AGEB information is scarse in these areas.

The following graph shows the distribution of the distance of the 4 main states in Mexico.

The next graph combines the data of the last two graphs: it shows the distance distribution depending on whether the area is rural, urban, semiurban or unknown and on whether the ZIP code is in any of the 4 biggest states in Mexico. Once more, in the urban and semiurban areas the distance is smaller than in rural areas.

This approach may fail in the rural areas and also, as can be noted, ZIP code polygons are generally bigger in area than AGEBs, so the heterogeneity of each ZIP code is being ignored.

Locality classification:

CONAPO (Comisión Nacional de Población, Population National Commission) makes a margination index by locality, defining margination as the set of social problems or disadvantages of a community or locality. The index pretends to summarize characteristics of the environment in which people live in using information of:

  • Percentage of population 15 years old or more who cannot read or write
  • Percentage of population 15 years old or more who don’t have a completed promary education
  • Percentage of households without a W.C.
  • Percentage of households without electricity
  • Percentage of households without tap water
  • Average number of people living in a a room
  • Percentage of households with dirt floor
  • Percentage of households without a refrigerator

The index is computed using Principal Coponent Analysis, the index is the the first component, which is best explained by the absence of refrigerator and percentage of people without primary education and percentage of people who can’t read or write. This index is the classified in five categories of margination: very low, low, medium, high and very high.

In the next images, maps of Mexico City and surroundings, Monterrey and Guadalajara are shown.

Map of Mexico City with centroids of each polygon:

Now, same map for Guadalajara, Jalisco:

And finally, for Monterrey, Nuevo León:

It can be seen in the previous maps that the localities in the main cities in Mexico are much bigger than the AGEBs and the ZIP codes, so the information provided by CONAPO may not give a good estimate of the level of each ZIP code because localities are more homogeneous.

Customer analysis

First, let’s see what’s the distribution of the classification of AGEBs in the country. Remember that 7 is that the AGEB is “good” in average and that 1 is that it’s “bad”.

And now, the mapping of the ZIP codes:

The distribution changed considerably. As we can see in the following graph, originally the AGEBs were urban (U) and rural (R), but the mapping consists of only urban ZIP codes; so this may be a reason of why the distribution changed so much.

And now let’s analyze the sample with 1 million savings customers and circa 800 thousand credit customers.

Out of the 1859441, we have the mapping ZIP code for 1590674 of them, which are distributed the following way:

And now, conditioning on whether it’s a credit or savings customer:

Crime Rate

Using information about crime reports we create four indexes that together give us a picture of the crime in the region. The indexes that we produce are: